226 research outputs found
TheanoLM - An Extensible Toolkit for Neural Network Language Modeling
We present a new tool for training neural network language models (NNLMs),
scoring sentences, and generating text. The tool has been written using Python
library Theano, which allows researcher to easily extend it and tune any aspect
of the training process. Regardless of the flexibility, Theano is able to
generate extremely fast native code that can utilize a GPU or multiple CPU
cores in order to parallelize the heavy numerical computations. The tool has
been evaluated in difficult Finnish and English conversational speech
recognition tasks, and significant improvement was obtained over our best
back-off n-gram models. The results that we obtained in the Finnish task were
compared to those from existing RNNLM and RWTHLM toolkits, and found to be as
good or better, while training times were an order of magnitude shorter
Using stacked transformations for recognizing foreign accented speech
A common problem in speech recognition for foreign accented speech is that there is not enough training data for an accent-specific or a speaker-specific recognizer. Speaker adaptation can be used to improve the accuracy of a speaker independent recognizer, but a lot of adaptation data is needed for speakers with a strong foreign accent. In this paper we propose a rather simple and successful technique of stacked transformations where the baseline models trained for native speakers are first adapted by using accent-specific data and then by another transformation using speaker-specific data. Because the accent-specific data can be collected offline, the first transformation can be more detailed and comprehensive, and the second one less detailed and fast. Experimental results are provided for speaker adaptation in English spoken by Finnish speakers. The evaluation results confirm that the stacked transformations are very helpful for fast speaker adaptation.Peer reviewe
Indexing Audio Documents by using Latent Semantic Analysis and SOM
This paper describes an important application for state-of-art automatic speech recognition, natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights. Indexing methods are evaluated for two broadcast news databases (one French and one English) using the average document perplexity defined in this paper and test queries analyzed by human expert
Indexing spoken audio by LSA and SOMs
This paper presents an indexing system for spoken audio documents. The framework is indexing and retrieval of broadcast news. The proposed indexing system applies latent semantic analysis (LSA) and self-organizing maps (SOM) to map the documents into a semantic vector space and to display the semantic structures of the document collection. The SOM is also used to enhance the indexing of the documents that are difficult to decode. Relevant index terms and suitable index weights are computed by smoothing the document vectors with other documents which are close to it in the semantic space. Experimental results are provided using the test data of the TREC's spoken document retrieval (SDR) track
PUHEENTUNNISTUS
Automaattinen puheentunnistus on merkittävä tilastollisten ja oppivienhahmontunnistusmenetelmien sovellus, joka on erityisasemassa myöslukuisten tavallisillekin ihmisille läheisten sovellustensa vuoksi. Vaikkapuheentunnistus on ihmiselle helppoa, on se kuitenkin erittäin haastavaongelma äänisignaalin ja kielen rikkauden ja monimuotoisuuden vuoksi.Tässä artikkelissa esitellään lyhyesti nykyaikaisen puheentunnistimientoimintaperiaatetta, ratkaisujen matemaattista perustaa sekä tunnistintensuorituskykyä. Lisäksi luodaan lyhyt katsaus puheentunnistuksentutkimukseen Suomessa.Avainsanat: PuheentunnistusKeywords: Speech recognitio
Fast latent semantic indexing of spoken documents by using self-organizing maps
This paper describes a new latent semantic indexing (LSI) method for spoken audio documents. The framework is indexing broadcast news from radio and TV as a combination of large vocabulary continuous speech recognition (LVCSR), natural language processing (NLP) and information retrieval (IR). For indexing, the documents are presented as vectors of word counts, whose dimensionality is rapidly reduced by random mapping (RM). The obtained vectors are projected into the latent semantic subspace determined by SVD, where the vectors are then smoothed by a self-organizing map (SOM). The smoothing by the closest document clusters is important here, because the documents are often short and have a high word error rate (WER). As the clusters in the semantic subspace reflect the news topics, the SOMs provide an easy way to visualize the index and query results and to explore the database. Test results are reported for TREC's spoken document retrieval databases
Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.Peer reviewe
On Using Distribution-Based Compositionality Assessment to Evaluate Compositional Generalisation in Machine Translation
Compositional generalisation (CG), in NLP and in machine learning more
generally, has been assessed mostly using artificial datasets. It is important
to develop benchmarks to assess CG also in real-world natural language tasks in
order to understand the abilities and limitations of systems deployed in the
wild. To this end, our GenBench Collaborative Benchmarking Task submission
utilises the distribution-based compositionality assessment (DBCA) framework to
split the Europarl translation corpus into a training and a test set in such a
way that the test set requires compositional generalisation capacity.
Specifically, the training and test sets have divergent distributions of
dependency relations, testing NMT systems' capability of translating
dependencies that they have not been trained on. This is a fully-automated
procedure to create natural language compositionality benchmarks, making it
simple and inexpensive to apply it further to other datasets and languages. The
code and data for the experiments is available at
https://github.com/aalto-speech/dbca.Comment: To appear at the GenBench Workshop at EMNLP 202
- …